X-MimeOLE: Produced By Microsoft Exchange V6.5
Received: by onstor-exch02.onstor.net 
	id <01C8CD72.6290BF00@onstor-exch02.onstor.net>; Fri, 13 Jun 2008 09:27:34 -0700
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Content-class: urn:content-classes:message
Subject: RE: Proposed design for new(ish) boot procedure for Cougar
Date: Fri, 13 Jun 2008 09:27:34 -0700
Message-ID: <BB375AF679D4A34E9CA8DFA650E2B04E0A6E8A06@onstor-exch02.onstor.net>
In-Reply-To: <20080612202832.5e5ff15b@ripper.onstor.net>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Proposed design for new(ish) boot procedure for Cougar
Thread-Index: AcjNBY7jAxHG5lQkSAitBSg9o+F/nAAbC2eQ
References: <20080612182458.010d3d89@ripper.onstor.net><02F5342D-628B-4BA3-B305-B499C3F49469@onstor.com> <20080612202832.5e5ff15b@ripper.onstor.net>
From: "Maxim Kozlovsky" <maxim.kozlovsky@onstor.com>
To: "Andy Sharp" <andy.sharp@onstor.com>,
	"Ian Brown" <ian.brown@onstor.com>
Cc: "dl-Design Review" <dl-designreview@onstor.com>,
	"Brian Stark" <brian.stark@onstor.com>,
	"Warren Gale" <warren.gale@onstor.com>



>-----Original Message-----
>From: Andy Sharp
>Sent: Thursday, June 12, 2008 8:29 PM
>To: Ian Brown
>Cc: dl-Design Review; Brian Stark; Warren Gale
>Subject: Re: Proposed design for new(ish) boot procedure for Cougar
>
>On Thu, 12 Jun 2008 18:34:00 -0700 Ian Brown <ian.brown@onstor.com>
>wrote:
>
>> In production, for the Cheetah, we have always rebooted the entire
>> box.  There were some daemons that relied on boot up order, thus I'd
>> guess that you would need to restart the daemons in phase 1 if
>> you're going to just bounce an embedded core.
>
>That's good to know.  What little I know about Cheetah operation would
>likely fall into the "Lore" category.
>
>Phase I is still rebooting the whole box.  Depending on the results of
>testing, Phase II may never see the light of day. ~:^)
[MK]

There is no need to restart the daemons. During cheetah development the
daemons which did care about fp/txrx/fc restarts learned to listen on a
slot/cpu up/down events and do the right thing. This used to work up to
3.2, after that I had to give up my cheetah and can't testify on the
account.

>
>
>> Ian
>>
>> On Jun 12, 2008, at 6:24 PM, Andrew Sharp wrote:
>>
>>                        Cougar Boot Procedure Redesign
>>                        ______________________________
>>
>> Problem
>> =3D=3D=3D=3D=3D=3D=3D
>>
>>     Booting takes far too long on Cougar, and in theory the embedded
>>     nodes should be rebootable w/o rebooting Linux on the Sibyte
1125.
>>
>> Reasons:
>>     1)    Image load from CF is intolerably slow
>>     2)    After image load, Linux boot takes the longest but is the
>>           least likely to need rebooting, resulting in an unnecessary
>> 		  bottleneck.
>>
>> Solution
>> =3D=3D=3D=3D=3D=3D=3D=3D
>>
>>     Redesign the boot flow to allow the embedded cores to be
>>     independently booted if Linux is up.
>>
>> Proposal
>> =3D=3D=3D=3D=3D=3D=3D=3D
>>
>>     Take a phased approach to implementing a redesigned boot
>> procedure:
>>
>> 	Phase I
>> 	-------
>> 	1)  Change SSC PROM to load and boot only Linux.
>> 	2)  Change FP/TXRX PROM to write a magic cookie in a
>> 	    predefined memory location indicating its readiness
>> 	    for it's image to be loaded.
>> 	3)  Impement an early start Linux daemon that waits for these
>> 	    boot magic cookies to be set by the embedded cores, loads
>> 	    their images to the correct memory locations, and signals
>> 	    to the FP/TXRX when finished.  The FP and TXRX could boot
>>             while Linux completes its boot steps.
>>
>> 	Phase 2
>> 	-------
>> 	1)  Through testing, determine what needs to be done to allow
>> 	    FP/TXRX to be rebooted independently without disturbing
>> the Linux kernel and each other.  Current daemons that
>>             communicate with FP/TXRX are not expected to be much
>> trouble since they had to handle this for Cheetah, although this has
>>             not been extensively tested on Cheetah in the last few
>>             releases.
>>
>> Expected Results
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>
>> Phase I
>> -------
>>
>> Current boot time           Predicted Boot time        Predicted
>> savings -----------------           -------------------
>> ----------------- 2 minutes, 57 secs          1 minute, 43.7
>> secs        1 minute, 13.7 secs
>>
>> 42% reduction in boot time: current boot time* is 2:57, resulting
boot
>> time is estimated to be 1:43.7, or, a savings of 1:13.7, or, the new
>> method would boot 1.7 times faster (2 times faster, or twice as fast,
>> would be a 50% reduction in boot time).
>>
>> These estimations based on a difference in image load time for the
>> FP/TXRX of 86 seconds for the PROM, and 12.7 seconds for Linux (cold
>> cache).
>>
>>
>> Phase II
>> --------
>> If just rebooting one or both of the FP/TXRX nodes, boot time
>> estimated to be in the sub 10 second range.  This would substantially
>> increase customer satisfaction and supportability, as well as
>> resulting in a substantial increase in developer efficiency.
>>
>>
>>
>>
>>
>> * Boot time measured from when PROM code starts loading the first
boot
>> image to when nfxsh CLI is available.
>>
